Characterizing Web Spam Using Content and HTTP Session Analysis
نویسندگان
چکیده
Web spam research has been hampered by a lack of statistically significant collections. In this paper, we perform the first large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus – a collection of about 350,000 web spam pages. Our content analysis results are consistent with the hypothesis that web spam pages are different from normal web pages, showing far more duplication of physical content and URL redirections. An analysis of session information collected during the crawling of the Webb Spam Corpus shows significant concentration of hosting IP addresses in two narrow ranges as well as significant overlaps among session header values. These findings suggest that content and HTTP session analysis may contribute a great deal towards future efforts to automatically distinguish web spam pages from normal web pages.
منابع مشابه
A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution
Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legit...
متن کاملDetecting Content Spam on the Web through Text Diversity Analysis
Web spam is considered to be one of the greatest threats to modern search engines. Spammers use a wide range of content generation techniques known as content spam to fill search results with low quality pages. We argue that content spam must be tackled using a wide range of content quality features. In this paper we propose a set of content diversity features based on frequency rank distributi...
متن کاملCharacterizing the Splogosphere
Weblogs or blogs collectively constitute the Blogosphere, forming an influential and interesting subset on the Web. As with most Internet-enabled applications, the ease of content creation and distribution makes the blogosphere spam prone. Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting ads or raising the Pag...
متن کاملA structural, content-similarity measure for detecting spam documents on the web
Purpose The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the qua...
متن کاملSearch Engine Click Spam Detection Based on Bipartite Graph Propagation
Using search engines to retrieve information has become an important part of people’s daily lives. For most search engines, click information is an important factor in document ranking. As a result, some websites cheat to obtain a higher rank by fraudulently increasing clicks to their pages, which is referred to as “Click Spam”. Based on an analysis of the features of fraudulent clicks, a novel...
متن کامل